Classificaiton

Advance Analytics with R (UG 21-24)

Ayush Patel

Before we start

Please load the following packages

library(tidyverse)
library(MASS)
library(ISLR)
library(ISLR2)
library(nnet)### get this if you don't
library(e1071) ## get this if you don't



Access lecture slide from bit.ly/aar-ug

Warrior's armor(gusoku)
Source: Armor (Gusoku)

Hello

I am Ayush.

I am a researcher working at the intersection of data, law, development and economics.

I teach Data Science using R at Gokhale Institute of Politics and Economics

I am a RStudio (Posit) certified tidyverse Instructor.

I am a Researcher at Oxford Poverty and Human development Initiative (OPHI), at the University of Oxford.

Reach me

ayush.ap58@gmail.com

ayush.patel@gipe.ac.in

Learning Objective

Dip our toes into classification techniques. How to apply and assess these methods.

References for this lecture:

  • Chapter 4, ISLR (reference)
  • Chapters 9, Intro to Modern Statistics (Reading for intuitive understanding)
  • Chapter 10.2 Modern Data Science with R

What is Classification?

  • Predict qualitative response
  • Approaches of predicting qualitative response, a process called classification.
  • A method or technique can be referred to as a classifier.
  • We will look into: logistic regression, linear discriminant analysis, quadratic discriminant analysis, naive Bayes and K-nearest neighbours

What actually happens

….often the methods used for classification first predict the probability that the observation belongs to each of the categories of a qualitative variable, as the basis for making the classification. In this sense they also behave like regression methods.

Why not use linear regression??

  • Nominal categorical variables have no rank. How to provide quantitative values?
  • Distance between Ordinal variable values are not easy to assign.
  • Could do something when the response is nominal with only two levels.
  • No guarantee that our estimates will be between [0,1]. Makes interpreting probabilities difficult.

Default data

default student balance income
No No 729.5265 44361.625
No Yes 817.1804 12106.135
No No 1073.5492 31767.139
No No 529.2506 35704.494
No No 785.6559 38463.496
No Yes 919.5885 7491.559
No No 825.5133 24905.227
No Yes 808.6675 17600.451
No No 1161.0579 37468.529
No No 0.0000 29275.268
No Yes 0.0000 21871.073
No Yes 1220.5838 13268.562
No No 237.0451 28251.695
No No 606.7423 44994.556
No No 1112.9684 23810.174
No No 286.2326 45042.413
No No 0.0000 50265.312
No Yes 527.5402 17636.540
No No 485.9369 61566.106
No No 1095.0727 26464.631

Logistic Regression

  • Logistic regressions are well suited for qualitative binary responses.
  • default variable from Default is our response(\(Y\)).
  • It has two levels Yes or No.
  • We model the probability that \(Y\) belongs to one a particular category.
  • \(Pr(default = Yes|balance)\) - logistic model estimates this. Is referred to as \(p(balance)\) as well.
  • Mainly, depending on risk aversion behaviour, \(a\) is chosen. \(p(balance) > a\), where \(0<=a<=1\).

But what if ?

I ran this: \(p(balance) = \beta_0 + \beta_1X\)

## make a dummy for default

Default|>
  mutate(
    default_dumm = ifelse(
      default == "Yes",
      1,0
    )
  )-> def_dum

## regress dummy over balance and plot 

lm(default_dumm ~ balance, 
   data = def_dum)|>
  broom::augment()|>
  ggplot(aes(balance,default_dumm))+
  geom_point(alpha= 0.6)+
  geom_line(aes(balance, .fitted),
            colour = "red")+
  labs(
    title = "Linear regression fit to qualitative response",
    subtitle = "Yes =1, No = 0",
    y = "prob default status"
  )+
  theme_minimal() -> plot_linear

## Run the logistic regression

glm(
  default_dumm ~ balance,
  data = def_dum,
  family = binomial
)|>
  broom::augment(type.predict = "response")|>
  ggplot(aes(balance,default_dumm))+
  geom_point(alpha= 0.6)+
  geom_line(aes(balance, .fitted),
            colour = "red")+
  labs(
    title = "Logistic regression fit to qualitative response",
    subtitle = "Yes =1, No = 0",
    y = "prob default status"
  )+
  theme_minimal() -> logistic_plot

Logistic Model

We saw that some fitted values in the linear model were negative.

We need a function that will return values between [0,1].

\[p(X) = \frac{e^{(\beta_0 + \beta_1X)}}{1+e^{\beta_0 + \beta_1X}}\]

This is the logistic function, modeled by the maximum likelihood method.

odds:

\[\frac{p(X)}{1-p(X)}\] **log odds or logit:

\[log(\frac{p(X)}{1-p(X)}) = \beta_0 + \beta_1X\]

Exercise - concept

if the following are the results of the model \(logit(p(default)) = \beta_0 + \beta_1Balance\):

term estimate std.error statistic p.value
(Intercept) -10.651330614 0.3611573721 -29.49221 3.623124e-191
balance 0.005498917 0.0002203702 24.95309 1.976602e-137

What is the probability of default with balance $5000??

Multiple logistic Regression

\[p(X) = \frac{e^{(\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n)}}{1+e^{\beta_0 + \beta_1X_1 + \beta_2X_2+...+\beta_nX_n}}\]

term estimate std.error statistic p.value
(Intercept) -1.086905e+01 4.922555e-01 -22.080088 4.911280e-108
income 3.033450e-06 8.202615e-06 0.369815 7.115203e-01
balance 5.736505e-03 2.318945e-04 24.737563 4.219578e-135
studentYes -6.467758e-01 2.362525e-01 -2.737646 6.188063e-03
term estimate std.error statistic p.value
(Intercept) -3.5041278 0.07071301 -49.554219 0.0000000000
studentYes 0.4048871 0.11501883 3.520181 0.0004312529

How to know if its good?

There is no consesus in statistics community over a single measure that can describe a goodness of fit for logistic regression.

glm(
  default_dumm ~ income + balance + student,
  data = def_dum,
  family = binomial
) -> mod_logit

DescTools::PseudoR2(mod_logit,
                    which = c("McFadden", "CoxSnell",
                              "Nagelkerke", "Tjur"))
  McFadden   CoxSnell Nagelkerke       Tjur 
 0.4619194  0.1262059  0.4982860  0.3355203 
AIC(mod_logit) # be careful with this
[1] 1579.545

Exercise

Use the Credit data in {ISLR}.

  • Create three new variables :
    • one: mark_south (1 if Region is South, else 0)
    • Two: mark_west (1 if Region is West, else 0)
    • Three: mark_east (1 if Region is East, else 0)
  • Create three binomial logistic models, one for each newly created variable.
  • Get \(PseudoR^2\) for each model.

What you just did is called Stratified binary model.

  • n models are created to understand probabilities related to n levels of the categorical response variable.
  • n models are non-comparable.
  • relative probabilities amongst n levels of the response are not known.

Relative risk or Baseline Approach

to Multinomial Logistic Regression

\[Pr(Y=k|X=x) = \frac{e^{\beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}xp}}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}\]

for k = 1,…K-1, and

\[Pr(Y=K|X=x) = \frac{1}{1+\sum_{l=1}^{K-1}e^{\beta_{l0}+\beta_{l1}x_1+...+\beta_{lp}x_p}}\]

Multinomial Logistic

\[log(\frac{Pr(Y=k|X=x)}{Pr(Y=K|X=x)}) = \beta_{k0}+\beta_{k1}x_1+...+\beta_{kp}xp\]

  • Which class is treated as reference or baseline is unimportant.

  • How to interpret this?

Data

Rows: 344
Columns: 8
$ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
$ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
$ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
$ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
$ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
$ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
$ sex               <fct> male, female, female, NA, female, male, female, male…
$ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…

Set Baseline/reference

palmerpenguins::penguins|>
  mutate(
    species = stats::relevel(species,
                             ref = "Gentoo")
  ) -> peng_ref

levels(peng_ref$species)
[1] "Gentoo"    "Adelie"    "Chinstrap"

Describe model

multi_log <- nnet::multinom(
  formula = species ~ body_mass_g + bill_length_mm + bill_depth_mm + flipper_length_mm + sex + island, 
  data = peng_ref
)
# weights:  27 (16 variable)
initial  value 365.837892 
iter  10 value 21.914358
iter  20 value 1.629266
iter  30 value 0.026372
final  value 0.000049 
converged

Peek into Summary - notice anything?

Call:
nnet::multinom(formula = species ~ body_mass_g + bill_length_mm + 
    bill_depth_mm + flipper_length_mm + sex + island, data = peng_ref)

Coefficients:
          (Intercept) body_mass_g bill_length_mm bill_depth_mm
Adelie       502.6573 -0.08755830     -20.075027      34.82987
Chinstrap   -434.3867 -0.02106537       6.332771     -16.48865
          flipper_length_mm   sexmale islandDream islandTorgersen
Adelie            0.5054518  33.23469    62.03886        144.9809
Chinstrap         1.7645190 -55.22699   335.85058         63.1425

Std. Errors:
          (Intercept) body_mass_g bill_length_mm bill_depth_mm
Adelie      0.5314853    2.351402       29.93540      5.286822
Chinstrap   0.5310960    4.080649       29.91681      5.278463
          flipper_length_mm   sexmale islandDream islandTorgersen
Adelie             49.88305 0.2294146    0.531096    4.701009e-47
Chinstrap          49.81079 0.2290253    0.531096   4.261135e-130

Residual Deviance: 9.874339e-05 
AIC: 32.0001 

Getting p-values

# calculate z-statistics of coefficients
z_stats <- summary(multi_log)$coefficients/
  summary(multi_log)$standard.errors

# convert to p-values
p_values <- (1 - pnorm(abs(z_stats)))*2


# display p-values in transposed data frame
data.frame(t(p_values))
                        Adelie   Chinstrap
(Intercept)       0.000000e+00 0.000000000
body_mass_g       9.702963e-01 0.995881131
bill_length_mm    5.024680e-01 0.832357200
bill_depth_mm     4.456258e-11 0.001785562
flipper_length_mm 9.919154e-01 0.971741303
sexmale           0.000000e+00 0.000000000
islandDream       0.000000e+00 0.000000000
islandTorgersen   0.000000e+00 0.000000000

Fitted values

           Gentoo        Adelie     Chinstrap
1   1.565008e-135  1.000000e+00 1.009721e-242
2    3.833780e-97  1.000000e+00 1.450741e-166
3   3.913549e-122  1.000000e+00 1.006490e-181
5   3.854489e-165  1.000000e+00 2.652195e-247
6   2.628864e-168  1.000000e+00 9.671388e-281
7   5.558841e-114  1.000000e+00 3.782674e-190
8   3.576898e-116  1.000000e+00 2.880335e-227
13  3.717985e-108  1.000000e+00 3.470609e-172
14  5.313520e-178  1.000000e+00 2.906555e-297
15  4.397228e-190  1.000000e+00 9.358591e-320
16  4.638659e-132  1.000000e+00 3.570151e-212
17  1.323762e-143  1.000000e+00 1.383387e-216
18  3.906700e-111  1.000000e+00 6.765722e-218
19  2.342237e-174  1.000000e+00 3.742224e-262
20  1.803219e-103  1.000000e+00 6.877397e-206
21   3.440325e-75  1.000000e+00 1.084873e-188
22   2.942167e-90  1.000000e+00 4.087688e-228
23   1.885524e-93  1.000000e+00 8.693837e-211
24   1.304105e-64  1.000000e+00 3.625997e-196
25   2.258588e-50  1.000000e+00 2.714209e-176
26   1.050278e-93  1.000000e+00 4.472482e-212
27   5.070290e-66  1.000000e+00 1.977702e-192
28   4.657299e-56  1.000000e+00 1.773975e-147
29   6.353727e-88  1.000000e+00 1.524263e-202
30   1.459617e-55  1.000000e+00 2.364103e-190
31   1.084357e-69  1.000000e+00  9.173878e-17
32  1.227459e-100  1.000000e+00  5.420913e-94
33   1.265506e-86  1.000000e+00  2.282452e-34
34   8.494378e-82  1.000000e+00  4.163335e-66
35  3.892077e-102  1.000000e+00  1.530044e-47
36  5.034347e-123  1.000000e+00 7.437468e-121
37  3.676963e-116  1.000000e+00 5.542033e-110
38   2.071518e-62  1.000000e+00  3.696922e-16
39  2.421798e-124  1.000000e+00  2.037458e-93
40   6.809138e-66  1.000000e+00  1.601284e-61
41  3.425600e-120  1.000000e+00  7.624079e-81
42   1.606314e-77  1.000000e+00  4.276671e-50
43  6.806524e-135  1.000000e+00  5.590874e-97
44   1.276361e-49  1.000000e+00  3.090448e-26
45  1.481080e-105  1.000000e+00  2.763406e-53
46   2.564070e-66  1.000000e+00  2.715698e-55
47   3.442093e-99  1.000000e+00  7.482183e-81
49  2.187000e-113  1.000000e+00  2.595632e-71
50   2.061308e-96  1.000000e+00  2.896132e-90
51   2.979432e-49  1.000000e+00 3.169427e-145
52   1.697699e-47  1.000000e+00 1.852417e-180
53   3.668664e-95  1.000000e+00 1.072902e-201
54   3.787966e-52  1.000000e+00 1.067250e-172
55  8.398157e-123  1.000000e+00 2.068522e-229
56   4.242336e-55  1.000000e+00 1.503984e-174
57   1.477876e-49  1.000000e+00 3.319481e-146
58   9.797629e-62  1.000000e+00 3.358364e-184
59   2.927841e-83  1.000000e+00 9.109545e-178
60   1.502988e-94  1.000000e+00 3.440099e-226
61   3.045152e-84  1.000000e+00 8.887102e-183
62   4.785268e-68  1.000000e+00 5.169778e-209
63   4.444320e-52  1.000000e+00 3.202953e-150
64   1.419865e-38  1.000000e+00 2.020648e-158
65   2.360178e-92  1.000000e+00 2.038745e-188
66   5.421996e-35  1.000000e+00 4.069958e-151
67   5.476171e-70  1.000000e+00 3.160205e-158
68   2.078893e-49  1.000000e+00 3.188062e-179
69  7.954905e-146  1.000000e+00 1.710102e-209
70   1.077474e-99  1.000000e+00 7.562552e-198
71  3.865774e-182  1.000000e+00 1.261787e-274
72  4.918193e-122  1.000000e+00 6.676413e-220
73  6.012949e-105  1.000000e+00 1.034230e-162
74   1.910730e-68  1.000000e+00 4.867732e-150
75  3.284153e-138  1.000000e+00 2.276590e-215
76   2.618102e-84  1.000000e+00 9.775178e-174
77   9.229092e-81  1.000000e+00 2.732396e-137
78  1.219806e-157  1.000000e+00 3.841097e-274
79  5.636064e-116  1.000000e+00 4.127281e-182
80  5.403569e-109  1.000000e+00 2.345329e-202
81  2.595345e-160  1.000000e+00 5.445574e-234
82   6.249382e-53  1.000000e+00 5.461387e-139
83  5.962076e-143  1.000000e+00 2.474997e-229
84  1.620648e-166  1.000000e+00 1.214968e-284
85  1.460100e-104  1.000000e+00  1.627128e-56
86  5.435650e-115  1.000000e+00  2.320514e-97
87  4.252100e-136  1.000000e+00 7.652729e-132
88  5.233585e-114  1.000000e+00  1.076593e-75
89  3.364375e-108  1.000000e+00  1.960486e-98
90   5.170773e-96  1.000000e+00  8.844224e-54
91  2.399353e-116  1.000000e+00  1.564235e-67
92   2.369104e-57  1.000000e+00  5.984160e-23
93  1.586023e-119  1.000000e+00  1.344923e-80
94   1.484469e-60  1.000000e+00  3.282288e-46
95  1.300253e-107  1.000000e+00  1.283398e-61
96   9.985286e-73  1.000000e+00  1.402530e-42
97   3.686971e-96  1.000000e+00  1.308798e-55
98   1.683356e-66  1.000000e+00  1.620603e-44
99  1.008476e-129  1.000000e+00  6.726151e-87
100  7.608280e-50  1.000000e+00  1.154731e-20
101  3.825182e-85  1.000000e+00 1.162753e-192
102  2.022024e-43  1.000000e+00 3.536458e-174
103  1.322537e-55  1.000000e+00 4.846989e-143
104  1.577519e-86  1.000000e+00 1.054969e-231
105 4.338556e-101  1.000000e+00 1.474158e-197
106  1.261728e-78  1.000000e+00 6.837177e-209
107  9.369108e-44  1.000000e+00 3.189705e-131
108  2.378118e-96  1.000000e+00 3.188707e-237
109  5.296813e-63  1.000000e+00 6.022331e-159
110  6.780156e-06  9.999932e-01 1.699143e-128
111  1.872526e-34  1.000000e+00 9.761655e-120
112  5.672252e-10  1.000000e+00 2.800928e-138
113  2.520868e-61  1.000000e+00 6.490000e-149
114  3.439657e-41  1.000000e+00 1.510222e-165
115  1.613357e-80  1.000000e+00 8.393870e-198
116  4.589386e-26  1.000000e+00 2.167375e-139
117 1.333710e-133  1.000000e+00 7.219327e-193
118 1.875383e-181  1.000000e+00 6.421625e-293
119 5.416429e-142  1.000000e+00 1.382857e-212
120 1.686160e-134  1.000000e+00 1.870521e-225
121 7.970359e-148  1.000000e+00 3.536496e-218
122 1.292156e-177  1.000000e+00 3.222353e-281
123  4.198999e-96  1.000000e+00 3.384342e-165
124 2.604573e-112  1.000000e+00 8.559222e-197
125 1.837631e-140  1.000000e+00 4.157304e-204
126 1.947839e-121  1.000000e+00 3.828437e-215
127 2.478227e-127  1.000000e+00 1.775509e-191
128  1.024613e-90  1.000000e+00 9.595811e-183
129 1.396937e-126  1.000000e+00 1.546359e-184
130  3.280699e-78  1.000000e+00 1.061450e-146
131 2.298752e-132  1.000000e+00 1.046066e-200
132 3.065056e-121  1.000000e+00 1.841132e-206
133 3.030640e-114  1.000000e+00  2.000481e-72
134  8.112726e-87  1.000000e+00  2.224610e-71
135  7.842629e-91  1.000000e+00  6.644925e-43
136  3.409778e-60  1.000000e+00  2.490694e-29
137 1.683988e-121  1.000000e+00  2.224177e-74
138 1.033267e-106  1.000000e+00  5.770265e-90
139  2.701357e-84  1.000000e+00  8.079601e-33
140  8.450468e-66  1.000000e+00  1.488202e-42
141  3.152874e-67  9.999593e-01  4.068580e-05
142  1.617609e-75  1.000000e+00  2.721354e-42
143 7.409834e-126  1.000000e+00  3.397055e-75
144  8.994604e-63  1.000000e+00  7.923681e-28
145 5.783065e-103  1.000000e+00  8.678595e-44
146 4.605532e-105  1.000000e+00  4.108604e-90
147  4.342044e-80  1.000000e+00  1.572962e-65
148 1.885663e-113  1.000000e+00  3.916754e-78
149 5.687431e-113  1.000000e+00  2.382362e-66
150 2.105977e-104  1.000000e+00  3.060488e-83
151  4.036940e-91  1.000000e+00  6.654021e-48
152  1.912664e-70  1.000000e+00  3.972737e-38
153  1.000000e+00 1.772889e-109  1.371973e-36
154  1.000000e+00 1.293197e-123  1.826463e-68
155  1.000000e+00 7.520028e-117  3.422658e-36
156  1.000000e+00 6.893219e-143  8.765637e-70
157  1.000000e+00 8.385801e-122  6.314732e-71
158  1.000000e+00 1.507993e-110  7.335537e-39
159  1.000000e+00  1.320615e-93  2.768662e-51
160  1.000000e+00  2.262479e-93  3.099923e-74
161  1.000000e+00  1.120134e-78  2.435404e-46
162  1.000000e+00  1.043816e-91  2.769695e-77
163  1.000000e+00  1.266441e-61  1.520768e-53
164  1.000000e+00 2.726199e-115  3.866374e-79
165  1.000000e+00 9.945072e-102  6.813724e-41
166  1.000000e+00 8.140272e-145  4.315103e-75
167  1.000000e+00  1.696385e-74  1.841659e-45
168  1.000000e+00 3.811851e-135  1.988430e-77
169  1.000000e+00  4.187589e-56  1.407891e-47
170  1.000000e+00 4.529948e-158  3.567558e-75
171  1.000000e+00 1.564320e-102  6.698092e-50
172  1.000000e+00 7.030489e-119  2.242786e-66
173  1.000000e+00 3.026554e-158  8.663190e-63
174  1.000000e+00  3.137029e-99  3.705486e-50
175  1.000000e+00  4.648131e-89  2.375477e-42
176  1.000000e+00  1.702306e-77  1.311517e-80
177  1.000000e+00 3.163518e-101  3.495546e-46
178  1.000000e+00  3.054452e-88  1.327207e-76
180  1.000000e+00 1.724005e-125  3.039510e-76
181  1.000000e+00 3.604353e-115  2.263329e-40
182  1.000000e+00 3.118941e-135  1.353999e-67
183  1.000000e+00 7.598731e-100  9.617386e-71
184  1.000000e+00  1.264204e-73  3.452318e-56
185  1.000000e+00 6.903902e-103  9.568626e-57
186  1.000000e+00 4.939525e-210  2.816545e-50
187  1.000000e+00 3.582723e-134  7.603848e-39
188  1.000000e+00 1.878878e-100  8.760223e-78
189  1.000000e+00  4.506500e-88  2.221559e-52
190  1.000000e+00  5.736024e-45  2.434454e-95
191  1.000000e+00  4.502070e-80  3.721294e-46
192  1.000000e+00 7.073412e-113  2.117083e-81
193  1.000000e+00  5.134575e-52  8.682756e-47
194  1.000000e+00 9.195023e-126  3.007258e-71
195  1.000000e+00  1.487214e-87  2.630540e-41
196  1.000000e+00 9.689026e-107  2.712435e-62
197  1.000000e+00 4.463268e-130  5.531538e-69
198  1.000000e+00  5.495591e-91  1.539825e-47
199  1.000000e+00  1.805392e-82  2.836442e-41
200  1.000000e+00 1.028279e-123  2.594745e-65
201  1.000000e+00 7.028607e-120  1.460173e-44
202  1.000000e+00  2.064603e-77  6.387252e-86
203  1.000000e+00 3.070581e-112  2.416780e-46
204  1.000000e+00 8.443716e-131  7.699229e-61
205  1.000000e+00  5.034473e-79  8.759494e-48
206  1.000000e+00 1.247832e-118  2.619672e-56
207  1.000000e+00 1.046291e-108  3.826834e-43
208  1.000000e+00  4.093166e-71  1.731317e-77
209  1.000000e+00  6.860213e-72  2.136910e-48
210  1.000000e+00  1.269283e-79  8.616304e-73
211  1.000000e+00  2.750180e-63  1.025108e-55
212  1.000000e+00 7.667920e-138  1.981604e-63
213  1.000000e+00  1.118393e-82  1.219461e-42
214  1.000000e+00  1.993765e-98  3.966097e-72
215  1.000000e+00  6.105472e-91  1.731410e-39
216  1.000000e+00 4.647250e-168  4.056522e-51
217  1.000000e+00  1.385310e-97  2.832574e-40
218  1.000000e+00 2.621620e-114  1.352348e-72
220  1.000000e+00 8.632477e-125  8.344145e-71
221  1.000000e+00  2.591495e-77  7.813423e-46
222  1.000000e+00 3.248586e-145  3.191987e-61
223  1.000000e+00 1.311400e-104  1.558088e-43
224  1.000000e+00  3.566976e-78  7.591760e-74
225  1.000000e+00  1.138779e-97  8.241421e-70
226  1.000000e+00 4.596155e-114  9.416621e-49
227  1.000000e+00  2.254524e-91  1.187537e-46
228  1.000000e+00 9.484012e-120  4.411945e-71
229  1.000000e+00 8.464504e-111  2.395072e-42
230  1.000000e+00 8.286447e-147  7.570869e-76
231  1.000000e+00 3.488978e-101  1.392094e-42
232  1.000000e+00  2.690618e-91  4.928889e-90
233  1.000000e+00 1.674415e-120  5.032080e-38
234  1.000000e+00 1.810814e-148  3.469254e-61
235  1.000000e+00 5.692476e-108  2.484898e-44
236  1.000000e+00 1.130152e-117  5.371048e-67
237  1.000000e+00  3.160061e-99  1.046213e-45
238  1.000000e+00 4.235316e-112  4.820883e-74
239  1.000000e+00  4.723465e-70  3.696940e-48
240  1.000000e+00 3.876329e-154  2.180304e-55
241  1.000000e+00 1.269662e-123  3.931959e-41
242  1.000000e+00 1.245313e-125  2.493731e-66
243  1.000000e+00 4.957267e-110  2.215446e-44
244  1.000000e+00 1.002205e-119  6.243525e-67
245  1.000000e+00  7.196249e-94  4.540894e-49
246  1.000000e+00 1.071104e-121  1.507151e-72
247  1.000000e+00  1.726970e-85  1.237235e-52
248  1.000000e+00 1.570425e-121  1.850985e-60
249  1.000000e+00  1.501573e-99  3.577315e-70
250  1.000000e+00 4.034356e-107  2.046856e-39
251  1.000000e+00 6.894741e-118  3.942318e-46
252  1.000000e+00 3.637413e-114  1.380370e-66
253  1.000000e+00 9.974536e-115  5.983107e-40
254  1.000000e+00 4.214496e-161  7.208930e-58
255  1.000000e+00 1.839915e-101  2.583646e-51
256  1.000000e+00 2.884899e-128  2.469645e-61
258  1.000000e+00  1.986041e-94  1.689584e-85
259  1.000000e+00  2.984444e-56  4.996325e-62
260  1.000000e+00 1.247952e-155  3.919568e-62
261  1.000000e+00  1.782553e-76  5.280700e-53
262  1.000000e+00 3.314837e-122  2.323582e-79
263  1.000000e+00 1.677925e-135  1.493023e-39
264  1.000000e+00 1.198820e-137  3.330126e-69
265  1.000000e+00  8.029180e-62  6.684889e-58
266  1.000000e+00 4.356844e-129  1.647240e-62
267  1.000000e+00  1.150705e-90  5.117224e-37
268  1.000000e+00 2.544552e-178  1.158961e-53
270  1.000000e+00 7.893047e-128  6.341904e-80
271  1.000000e+00 5.236358e-127  9.840445e-39
273  1.000000e+00 2.258098e-111  1.118938e-42
274  1.000000e+00 7.780342e-140  1.175677e-69
275  1.000000e+00 7.922007e-104  3.689074e-56
276  1.000000e+00 4.308239e-118  1.367370e-77
277  9.383743e-73  4.229444e-53  1.000000e+00
278  2.414257e-46  6.692323e-33  1.000000e+00
279  4.687592e-52  1.229642e-45  1.000000e+00
280  1.048175e-60  3.444186e-21  1.000000e+00
281  5.469722e-54  1.129503e-52  1.000000e+00
282  2.241766e-70  1.074536e-56  1.000000e+00
283  4.593230e-61  5.951872e-27  1.000000e+00
284  2.288727e-61  5.338912e-73  1.000000e+00
285  1.432389e-60  1.726689e-45  1.000000e+00
286  2.039038e-50  3.258510e-34  1.000000e+00
287  9.109742e-72  1.098026e-65  1.000000e+00
288  6.685114e-45  7.275545e-30  1.000000e+00
289  3.123457e-71  3.729335e-74  1.000000e+00
290  2.497980e-64  4.169730e-94  1.000000e+00
291  1.295953e-74  4.031413e-65  1.000000e+00
292  1.838294e-49  1.795872e-43  1.000000e+00
293  7.633018e-50  2.033881e-08  1.000000e+00
294  7.713006e-95 5.573962e-187  1.000000e+00
295  2.164057e-66  8.161922e-34  1.000000e+00
296  4.115408e-48  1.364831e-66  1.000000e+00
297  1.978714e-56  2.528745e-16  1.000000e+00
298  2.777610e-57  4.234493e-43  1.000000e+00
299  1.206631e-74  3.632233e-24  1.000000e+00
300  2.515493e-47  1.752889e-37  1.000000e+00
301  1.966260e-77  2.935052e-51  1.000000e+00
302  6.646101e-54  9.510346e-75  1.000000e+00
303  3.208245e-87  2.559884e-89  1.000000e+00
304  1.575148e-52  1.308607e-37  1.000000e+00
305  1.340750e-70  2.068844e-59  1.000000e+00
306  2.051777e-51  1.462469e-77  1.000000e+00
307  1.418427e-65  1.883947e-06  9.999981e-01
308  9.302754e-49  2.214646e-66  1.000000e+00
309  6.913968e-68  6.639889e-27  1.000000e+00
310  1.217067e-57  1.420780e-69  1.000000e+00
311  6.092297e-54  2.616066e-40  1.000000e+00
312  4.367672e-85 1.831430e-104  1.000000e+00
313  5.181682e-72  1.506746e-68  1.000000e+00
314  9.562039e-46  9.723739e-63  1.000000e+00
315  1.754656e-90  1.469712e-63  1.000000e+00
316  1.634584e-54  2.249546e-86  1.000000e+00
317  7.277925e-54  1.567397e-30  1.000000e+00
318  1.370821e-69  3.584048e-60  1.000000e+00
319  6.936680e-55  4.964565e-42  1.000000e+00
320  1.631335e-79  7.066499e-64  1.000000e+00
321  2.551698e-86 8.374298e-111  1.000000e+00
322  1.666233e-54  5.578999e-83  1.000000e+00
323  4.887942e-82  2.089817e-90  1.000000e+00
324  1.767910e-51  1.671757e-39  1.000000e+00
325  3.012468e-55  3.049238e-44  1.000000e+00
326  4.008165e-89 1.181851e-112  1.000000e+00
327  7.332885e-95 1.178048e-103  1.000000e+00
328  3.782168e-57  2.803699e-64  1.000000e+00
329  1.058241e-74  9.870156e-61  1.000000e+00
330  7.903304e-51  1.246369e-44  1.000000e+00
331  1.368628e-63  1.565206e-13  1.000000e+00
332  2.730994e-62  2.762055e-61  1.000000e+00
333  5.220568e-80  2.129608e-59  1.000000e+00
334  1.514945e-45  4.068005e-24  1.000000e+00
335  2.029049e-57  3.448219e-51  1.000000e+00
336  7.675421e-61  3.661378e-11  1.000000e+00
337  8.942477e-59  1.327385e-61  1.000000e+00
338  6.212078e-80  1.959542e-90  1.000000e+00
339  6.325469e-78  5.897002e-70  1.000000e+00
340  1.160711e-67 1.231273e-101  1.000000e+00
341  1.194972e-71  8.123202e-17  1.000000e+00
342  2.133439e-53  4.893635e-52  1.000000e+00
343  5.050237e-61  1.191538e-66  1.000000e+00
344  2.773257e-79  6.303744e-89  1.000000e+00

Exercise

Use the Publication data from ISLR2.

Split data into 80%-20% training and test set randomly.

Generate a multinomial logistic model to classify variable mech.

use the test data to predict mech variable. See if it is a reasonable fit.

Test Error Rate

What was the test error rate you get for the previous exercise?

\[Ave(I(y_0 \neq \hat y_0))\]

The Bayes Classifier

“a classifier that assigns each observation to the most likely class, given its predictor values” minimizes the test error rate.

  • This lowest error rate is called Bayes Error Rate

  • Bayes Decision Boundary

  • Why not always use Bayes Classifier?

Moving Forward

Keep in mind the good old Bayes Rule

\[P(A|B) = \frac{P(B|A)* P(A)}{P(B)}\]

Generative Models - What ?

  • We saw that logistic model estimates \(Pr(Y=k|X=x)\).
  • Alternatively, we model distribution of each predictor for a given class of Y.
  • Then we use the Bayes rule to get \(Pr(Y=k|X=x)\)
  • “When the distribution of X within each class is assumed to be normal, it turns out that the model is very similar in form to logistic regression”

Generative Models - Why?

  • For logistic regression, unstable parameter estimates when separation between two classes is substantial.
  • When distribution of X for each class of Y is normal and the sample size is small, these methods do better than logistic regression.

Generative Models - How?

\(\pi_k\) is the overall probability of seeing \(k^{th}\) class of response in data.

\(f_k(X) = Pr(X|Y=k)\)

\[Pr(Y=k|X=x) = \frac{\pi_k*f_k(x)}{\sum_{l=1}^K\pi_lf_l(x)}\]

We are trying to approximate the Bayes classifier!! We will esplore linear discriminant analysis, quadratic discriminant analysis and naive Bayes

LDA for One predictor

Over arching goal is to figure out the \(f_k(x)\)

To achieve our goal, we assume that \(f_k(x)\) is normal.

\[f_k(x) = \frac{1}{\sigma_k\sqrt{2\pi}}exp(-\frac{1}{2\sigma_k^2}(x-\mu_k)^2)\]

Here, \(\mu_k\) and \(\sigma_k^2\) is the mean and variance parameter of the \(k^th\) class.

we also assume, that \(\sigma_1^2 = ...\sigma_K^2\)

LDA for One Predictor

\[ Pr(Y=k|X=x) = \frac{\pi_k*\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{1}{2\sigma^2}(x-\mu_k)^2)}{\sum_{l=1}^K\pi_l\frac{1}{\sigma\sqrt{2\pi}}exp(-\frac{1}{2\sigma^2}(x-\mu_k)^2)} \]

\[ log(Pr(Y=k|X=x)) = x.\frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2} + log(\pi_k) \]

\[ x = \frac{\mu_1^2-\mu_2^2}{2(\mu_1-\mu_2)}= \frac{\mu_1 + \mu_2}{2} \]

Applying LDA

lda_default_balance_student <-
  MASS::lda(default ~ balance + student, data = Default)
lda_default_balance_student
Call:
lda(default ~ balance + student, data = Default)

Prior probabilities of groups:
    No    Yes 
0.9667 0.0333 

Group means:
      balance studentYes
No   803.9438  0.2914037
Yes 1747.8217  0.3813814

Coefficients of linear discriminants:
                    LD1
balance     0.002244397
studentYes -0.249059498

Applying LDA - training error rate

mean(
  predict(lda_default_balance_student,
          newdata = Default)$class != Default$default
)
[1] 0.0275
  • training error rate

  • trivial null classifier

predict(lda_default_balance_student,
          newdata = Default)|>names()
[1] "class"     "posterior" "x"        

Exercise

  • See the OJ data set in ISLR2

  • Use this data set to predict variable purchase

  • Split data into 80/20 training and testing.

  • Use training data to develop a LDA model. Use RoC and confusion matrix to gauge model effectiveness. Fine tune model. See chapter 9 TMWR.

  • predict test data with the fine tuned model.

QDA

Quadratic Discriminant Analysis

  • This too assumes that observations within each class are drawn from a Gaussian distribution.
  • However, the assumption of common covariance matrix is not held to be true in QDA. This is where it differs from LDA.
  • This leads to the \(x\) in discriminant function to appear as quadratic.
  • Now, \(Kp(p+1)/2\) parameters need to be estimated for covariance matrix instead of p(p+1)/2. This is where bias variance trade off comes to play.
  • This means LDA can have low variance and high bias, especially if the \(\sigma_1^2=....=\sigma_K^2\) assumption is badly off.

Exercise

See the Smarket data in ISLR2.

Split in 80/20 training and testing.

Train LDA and QDA models.

Test these models and compare results - use test error rate.

What happens if you take n number of training data sets and n number of testing data sets, run LDA and QDA on each pair and plot training error rate and testing error rate distributions?

Naive Bayes

  • From LDA and QDA we have seen that estimating \(\pi_1...\pi_K\) is easy.
  • Estimating \(f_1(x).....f_K(x)\) is difficult.
  • The estimates of LDA and QDA help us avoid estimating a K p-dimensional density functions.
  • The Naive Bayes Classifier makes only one assumption - Within the kth class, the p predictors are independent.

\[f_k(x) = f_{k1}(x_1)*f_{k2}(x_2)*...*f_{kp}(x_p)\]

Naive Bayes

\[pr(X) = \frac{\pi_k*f_{k1}(x_1)*f_{k2}(x_2)*...*f_{kp}(x_p)}{\sum_{l=1}^K \pi_l*f_{l1}(x_1)*f_{l2}(x_2)*...*f_{lp}(x_p)}\] > How is \(f_{kj} estimated?\)

Exercise

use naiveBayes function from e1071 package.

Use Smarket data and compared results with QDA.